Extracting Collocations from syntactically annotated biomedical Corpora
نویسندگان
چکیده
This thesis investigates the extraction of frequently used phrases (so called collocations) from biomedical text sources. The extraction of uninterrupted collocation candidates is introduced. For interrupted candidates, with gaps between their subcomponents, a new technique using suffix tries is developed. It is based on the iterative extension of frequent smaller patterns. This reduces computational complexity, compared to previous approaches to this task. Extraction is further extended to annotated biomedical corpora, which incorporate additional information for each word (i.e. base form and syntactic category). This enables the discovery of patterns with wild-cards for specific attributes. Finally, different common approaches for scoring candidates are explained, and their applicability to the extracted candidates is evaluated. Zusammenfassung Diese Studienarbeit untersucht die Extraktion von häufig verwendeten Phrasen (so genannten Kollokationen) aus biomedizinischen Texten. Die Extraktion von zusammenhängenden Kollokationskandidaten wird vorgestellt. Für unterbrochene Kandidaten, deren Teile durch Lücken getrennt sind, wird ein neues Verfahren, basierend auf Suffixbäumen dargestellt. Dieses Verfahren basiert auf der iterativen Erweiterung von häufig vorkommenden kleineren Mustern. Dadurch wird der Berechnungsaufwand gegenüber bisher verwendeten Verfahren erheblich reduziert. Im Weiteren wird die Verarbeitung von syntaktisch annotierten biomedizinischen Korpora erläutert, die mit zusätzlicher Information für jedes Wort ausgezeichnet sind (z.B. Basisform und syntaktische Kategorie). Dies ermöglicht die Extraktion von Mustern, die Wildcards für bestimmte Attribute enthalten. Schließlich werden verschiedene gebräuchliche Verfahren zur Bewertung von Kollokationskandidaten erklärt und ihre Anwendung auf die in Experimenten extrahierten Kandidaten beurteilt.
منابع مشابه
Extraction of Multi-Word Collocations Using Syntactic Bigram Composition
This paper presents a method for extracting multi-word collocations (MWCs) from text corpora, which is based on the previous extraction of syntactically bound collocation bigrams. We describe an iterative word linking procedure which relies on a syntactic criterion and aims at building up arbitrarily long expressions that represent multi-word collocation candidates. We propose several measures ...
متن کاملFipsCoView: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora
We introduce FipsCoView, an on-line interface for dictionary-like visualisation of collocations detected from parallel corpora using a syntactically-informed extraction method.
متن کاملConFarm: Extracting Surface Representations of Verb and Noun Constructions from Dependency Annotated Corpora of Russian
ConFarm is a web service dedicated to extraction of surface representations of verb and noun constructions from dependency annotated corpora of Russian texts. Currently, the extraction of constructions with a specific lemma from SynTagRus and Russian National Corpus is available. The system provides flexible interface that allows users to finetune the output. Extracted constructions are grouped...
متن کاملExtracting Collocations from Text Corpora
A collocation is a habitual word combination. Collocational knowledge is essential for many tasks in natural language processing. We present a method for extracting collocations from text corpora. By comparison with the SUSANNE corpus, we show that both high precision and broad coverage can be achieved with our method. Finally, we describe an application of the automatically extracted collocati...
متن کاملRetrieving Collocations by Co-occurrences and Word Order Constraints
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages: 1) extracting strings of characters as units of collocations 2) extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method, various range of collocations, especially...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002